25 research outputs found

    Multi-GPU Graph Analytics

    Full text link
    We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details. We analyze the theoretical and practical limits to scalability in the context of varying graph primitives and datasets. We describe several optimizations, such as direction optimizing traversal, and a just-enough memory allocation scheme, for better performance and smaller memory consumption. Compared to previous work, we achieve best-of-class performance across operations and datasets, including excellent strong and weak scalability on most primitives as we increase the number of GPUs in the system.Comment: 12 pages. Final version submitted to IPDPS 201

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    Performance Characterization of High-Level Programming Models for GPU Graph Analytics

    Full text link
    We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gunrock, MapGraph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient large-scale graph analytics on the GPU

    Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

    Full text link
    We present Distributed Equivalent Substitution (DES) training, a novel distributed training framework for large-scale recommender systems with dynamic sparse features. DES introduces fully synchronous training to large-scale recommendation system for the first time by reducing communication, thus making the training of commercial recommender systems converge faster and reach better CTR. DES requires much less communication by substituting the weights-rich operators with the computationally equivalent sub-operators and aggregating partial results instead of transmitting the huge sparse weights directly through the network. Due to the use of synchronous training on large-scale Deep Learning Recommendation Models (DLRMs), DES achieves higher AUC(Area Under ROC). We successfully apply DES training on multiple popular DLRMs of industrial scenarios. Experiments show that our implementation outperforms the state-of-the-art PS-based training framework, achieving up to 68.7% communication savings and higher throughput compared to other PS-based recommender systems.Comment: Accepted by SIGIR '2020. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 202

    Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

    Full text link
    Connected components and spanning forest are fundamental graph algorithms due to their use in many important applications, such as graph clustering and image segmentation. GPUs are an ideal platform for graph algorithms due to their high peak performance and memory bandwidth. While there exist several GPU connectivity algorithms in the literature, many design choices have not yet been explored. In this paper, we explore various design choices in GPU connectivity algorithms, including sampling, linking, and tree compression, for both the static as well as the incremental setting. Our various design choices lead to over 300 new GPU implementations of connectivity, many of which outperform state-of-the-art. We present an experimental evaluation, and show that we achieve an average speedup of 2.47x speedup over existing static algorithms. In the incremental setting, we achieve a throughput of up to 48.23 billion edges per second. Compared to state-of-the-art CPU implementations on a 72-core machine, we achieve a speedup of 8.26--14.51x for static connectivity and 1.85--13.36x for incremental connectivity using a Tesla V100 GPU

    Gunrock: A Programming Model and Implementation for Graph Analytics on Graphics Processing Units

    No full text
    The high-performance, highly parallel, fully programmable modern Graphics Processing Unit's high memory bandwidth, computing power, excellent peak throughput, and energy efficiency brings acceleration to regular applications that have extensive data parallelism, regular memory access patterns, and modest synchronizations. However, for graph analytics, the inherent irregularity of graph data structures leads to irregularity in data access and control flow, making efficient graph analytics on GPUs a significant challenge. Despite some promising specialized GPU graph algorithm implementations, parallel graph analytics on the GPU in general still faces two major challenges. The first is the programmability gap between low-level implementations of specific graph primitives and a general graph processing system. Programming graph algorithms on GPUs is difficult even for the most skilled programmers. Specialized GPU graph algorithm implementations do not generalize well since they often couple a specific graph computation to a specific type of parallel graph operation. The second is the lack of a GPU-specific graph processing programming model. High-level GPU programming models for graph analytics often recapitulate CPU programming models and do not compare favorably in performance with specialized implementations due to different kinds of overhead introduced by maintaining a high-level framework. This dissertation seeks to resolve the conflict of programmability and performance for graph analytics on the GPU by designing a GPU-specific graph processing programming model and building a graph analytics system on the GPU that not only allows quick prototyping of new graph primitives but also delivers the performance of customized, complex GPU hardwired graph primitives. To achieve this goal, we present a novel data-centric abstraction for graph operations that allows programmers to develop graph primitives at a high level of abstraction while simultaneously delivering high performance by incorporating several profitable optimizations, which previously were only applied to different individual graph algorithm implementations on the GPU, into the core of our implementation, including kernel fusion, push-pull traversal, idempotent traversal, priority queues, and various workload mapping strategies. We design and implement a new graph analytics system, Gunrock, which contains a set of simple and flexible graph operation APIs that can express a wide range of graph primitives at a high level of abstraction. Using Gunrock, we implement a large set of graph primitives, which span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. All of our graph primitives achieve comparable performance to their hardwired counterparts and significantly outperform previous programmable GPU abstractions
    corecore